Pset 5 - Plotly, Shiny, Interactions, Splines - Part 1 and 2

library(leaflet)
library(sf)
library(sp)
library(splines)
library(tidyverse)
library(pubtheme)
library(readr)

Objectives

Choosing a home location

A prospective homeowner is interested in buying a house in Connecticut. Since she works in New York City near Grand Central Terminal and her partner works in New Haven, she has the following preferences for the location of her new home:

  • between NYC and New Haven,
  • relatively close to a Metro-North train station,
  • special preference towards stations with many options for convenient trains traveling to Grand Central Station and New Haven.

She also prefers to live in a town where

  • the house prices aren’t that high, and
  • the population is relatively diverse, and where at least 5% of the population is Asian.

1. Visualization

Create an interactive visualization or visualizations that she can use to help her make her decision. Use the data below, which contains

  • US Census data (same as what we previously used in class)
  • Latitude/longitude for Metro-North stations.
  • For every pair of stations, information about trips between those two stations: number of trips, the mean duration, and the minimum duration. This has been filtered so that only trips to/from Grand Central, to/from New Haven, and along the train lines that are in Connecticut (New Haven Line, Danbury Line, Waterbury Line, New Canaan Line) remain.

Here is the data with some light cleaning to help you get started.

Census data

dc = readRDS('data/tracts.and.census.with.EV.stations.rds')
dc = dc[dc$state == 'CT',]
head(dc@data,2)
      STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID NAME LSAD
12844      09      001  090300 1400000US09001090300 09001090300  903   CT
12845      09      001  090400 1400000US09001090400 09001090400  904   CT
        ALAND AWATER  meters    miles state tract           county  state.full
12844 4764507      0 4764507 1.839586    CT   903 Fairfield County Connecticut
12845 7347827      0 7347827 2.837012    CT   904 Fairfield County Connecticut
       pop male female  age male.age female.age white black indian.alaskan
12844 4611 2333   2278 42.9     42.7       43.1  4230    24              0
12845 6518 3355   3163 40.6     36.5       45.7  4742   654             78
      asian pacific other two.or.more white.not.hisp hisp white.hisp black.hisp
12844   180       0    66         111           3888  435        342          0
12845   524      24    88         408           4324  613        418          0
      households i10orless i10to14 i15to19 i20to24 i25to29 i30to34 i35to39
12844       1550        56       8      22      34      16      12      31
12845       2140        24      18      85       0      89      36      49
      i40to44 i45to49 i50to59 i60to74 i75to99 i100to124 i125to149 i150to199
12844      19      43      59      67     231       204       197       261
12845       0      94      24     128     219       343       233       421
      i200ormore hh.income house.value   male.p female.p  white.p    black.p
12844        290    118819      372100 50.59640 49.40360 91.73715  0.5204945
12845        377    121802      344600 51.47284 48.52716 72.75238 10.0337527
       asian.p   hisp.p white.not.hisp.p white.hisp.p black.hisp.p  other.p
12844 3.903709 9.433962         84.32010     7.417046            0 3.838647
12845 8.039276 9.404725         66.33937     6.413010            0 9.174593
      rescaled.house.value hh.income.and.house tot.hh.income tot.house.value
12844             80487.25            99653.12     184169450       576755000
12845             76759.97            99280.99     260656280       737444000
      tot.hh.income.and.house pop.density hh.density income.density
12844               154462344    2506.542   842.5807      100114594
12845               212461309    2297.488   754.3148       91877050
      house.value.density house.and.income.density lev2 lev3
12844           313524273                 83965798    2    4
12845           259936875                 74889116   NA   NA

Train stations

st = read.csv('data/stops.txt')
st = st %>%
  select(stop_id, stop_code, stop_name, stop_lat, stop_lon)
head(st)
  stop_id stop_code          stop_name stop_lat  stop_lon
1       1       0NY      Grand Central 40.75300 -73.97706
2       4       0HL      Harlem-125 St 40.80516 -73.93915
3     622       0YS   Yankees-E 153 St 40.82530 -73.92990
4     184       0HI    Highbridge Yard 40.83590 -73.93150
5       9       0MH     Morris Heights 40.85425 -73.91958
6      10       0UH University Heights 40.86225 -73.91312

Trips

tr = readRDS('data/trips.rds')
tr = tr %>% 
  arrange(stop1, stop2)
head(tr, 10)



# Read the RDS file
data <- readRDS("data/trips.rds")  # Change the filename accordingly

write_csv(data, "trips.csv")  # Saves as CSV

readLines("data/stops.txt", n = 5)
         stop1   tf         stop2    n      mean min
1       Bethel from Grand Central  246 110.19919 104
2       Bethel   to Grand Central  188 117.61702 112
3  Branchville from Grand Central  246  94.19919  88
4  Branchville   to Grand Central  188 100.61702  95
5   Bridgeport from Grand Central 3024  90.20602  73
6   Bridgeport   to Grand Central 2920  98.40651  77
7   Bridgeport   to     New Haven 3066  34.35486  25
8   Bridgeport from     New Haven 2804  25.17511  20
9   Cannondale from Grand Central  246  87.19919  81
10  Cannondale   to Grand Central  188  93.61702  88
[1] "stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,wheelchair_boarding"
[2] "1,0NY,Grand Central,,40.752998,-73.977056,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=1,0,,1"                
[3] "4,0HL,Harlem-125 St,,40.805157,-73.939149,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=2,0,,1"                
[4] "622,0YS,Yankees-E 153 St,,40.8253,-73.9299,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=3,0,,1"               
[5] "184,0HI,Highbridge Yard,,40.8359,-73.9315,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=,0,,0"                 

Visualizations

Create your visualization(s) here.

library(sf)
library(ggplot2)
library(dplyr)
library(ggpattern) 

dg = st_as_sf(dc)
head(dg,2)
head(dg$geometry, 2)

mean_house_value <- mean(dg$house.value, na.rm = TRUE)

dg$diversity_color <- as.factor(ifelse(dg$asian.p > 5, "Diverse", "Not Diverse"))
dg$house_value_below_mean <- as.factor(ifelse(dg$house.value < mean_house_value, "Below Mean", "Above Mean"))

trips <- tr %>% 
  left_join(st, by = c("stop1" = "stop_name")) %>%
  left_join(st, by = c("stop2" = "stop_name"), suffix = c(".1",".2")) %>%
  filter(!is.na(stop_lat.1), !is.na(stop_lat.2), !is.na(stop_lon.2))
Simple feature collection with 2 features and 73 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -73.24856 ymin: 41.22053 xmax: -73.17855 ymax: 41.25491
Geodetic CRS:  NAD83
      STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID NAME LSAD
12850      09      001  090300 1400000US09001090300 09001090300  903   CT
12851      09      001  090400 1400000US09001090400 09001090400  904   CT
        ALAND AWATER  meters    miles state tract           county  state.full
12850 4764507      0 4764507 1.839586    CT   903 Fairfield County Connecticut
12851 7347827      0 7347827 2.837012    CT   904 Fairfield County Connecticut
       pop male female  age male.age female.age white black indian.alaskan
12850 4611 2333   2278 42.9     42.7       43.1  4230    24              0
12851 6518 3355   3163 40.6     36.5       45.7  4742   654             78
      asian pacific other two.or.more white.not.hisp hisp white.hisp black.hisp
12850   180       0    66         111           3888  435        342          0
12851   524      24    88         408           4324  613        418          0
      households i10orless i10to14 i15to19 i20to24 i25to29 i30to34 i35to39
12850       1550        56       8      22      34      16      12      31
12851       2140        24      18      85       0      89      36      49
      i40to44 i45to49 i50to59 i60to74 i75to99 i100to124 i125to149 i150to199
12850      19      43      59      67     231       204       197       261
12851       0      94      24     128     219       343       233       421
      i200ormore hh.income house.value   male.p female.p  white.p    black.p
12850        290    118819      372100 50.59640 49.40360 91.73715  0.5204945
12851        377    121802      344600 51.47284 48.52716 72.75238 10.0337527
       asian.p   hisp.p white.not.hisp.p white.hisp.p black.hisp.p  other.p
12850 3.903709 9.433962         84.32010     7.417046            0 3.838647
12851 8.039276 9.404725         66.33937     6.413010            0 9.174593
      rescaled.house.value hh.income.and.house tot.hh.income tot.house.value
12850             80487.25            99653.12     184169450       576755000
12851             76759.97            99280.99     260656280       737444000
      tot.hh.income.and.house pop.density hh.density income.density
12850               154462344    2506.542   842.5807      100114594
12851               212461309    2297.488   754.3148       91877050
      house.value.density house.and.income.density lev2 lev3
12850           313524273                 83965798    2    4
12851           259936875                 74889116   NA   NA
                            geometry
12850 MULTIPOLYGON (((-73.24539 4...
12851 MULTIPOLYGON (((-73.22186 4...
Geometry set for 2 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -73.24856 ymin: 41.22053 xmax: -73.17855 ymax: 41.25491
Geodetic CRS:  NAD83
# Convert stops to sf (filter out missing coordinates)
stations_sf <- st %>%
  filter(!is.na(stop_lon) & !is.na(stop_lat)) %>%
  st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))

# Aggregate number of stops per region (filter out missing stops)
stop_counts <- tr %>%
  count(stop1) %>%
  rename(num_stops = n) %>%
  left_join(st, by = c("stop1" = "stop_name")) %>%
  filter(!is.na(stop_lon) & !is.na(stop_lat))  # Remove missing coordinates

stop_counts_sf <- stop_counts %>%
  st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))

# Visualization
g <- ggplot() + 
  # Base layer: Fill color for Diversity, Border color for House Value
  geom_sf(data = dg, aes(fill = diversity_color, color = house_value_below_mean), size = 0.5, show.legend = TRUE) +
  
  # Overlay stops with size based on number of stops (Larger Dots = More Stops)
  geom_sf(data = stop_counts_sf, aes(size = (1/num_stops)), color = "purple", alpha = 0.7) +
  
  # Custom fill and color scales
  scale_fill_manual(values = c("Diverse" = "blue", "Not Diverse" = "grey50"), name = "Diversity (Asian > 5%)") + 
  scale_color_manual(values = c("Below Mean" = "red", "Above Mean" = "black"), name = "House Value Stats") +
  scale_size_continuous(range = c(4, 1), name = "Number of Stops") +  # Now bigger dots = more stops
  
  # Titles and theme
  labs(title = "Public Transport, Diversity & Housing",
       subtitle = "Number of stops, diversity, and housing value",
       caption = "Purple Dots: Transit Stops (Bigger Dots = More Stops)") + 
  theme_minimal()

g

Make it interactive:

library(leaflet)
library(dplyr)
library(sf)

# Convert stops data to sf for Leaflet
stations_sf <- st %>%
  filter(!is.na(stop_lon) & !is.na(stop_lat)) %>%
  st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))

# Aggregate number of stops per region (filter out missing stops)
stop_counts <- tr %>%
  count(stop1) %>%
  rename(num_stops = n) %>%
  left_join(st, by = c("stop1" = "stop_name")) %>%
  filter(!is.na(stop_lon) & !is.na(stop_lat))  # Remove missing coordinates

# Convert aggregated stops into sf
stop_counts_sf <- stop_counts %>%
  st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))

# Create the interactive Leaflet map
leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%  # Light background map

  # Add town boundaries with diversity and house value color codes
  addPolygons(data = dg, 
              fillColor = ~ifelse(diversity_color == "Diverse", "blue", "grey"),
              color = ~ifelse(house_value_below_mean == "Below Mean", "red", "black"),
              weight = 1, fillOpacity = 0.5,
              popup = ~paste0("<b>Town:</b> ", NAME, "<br>",
                              "<b>Diversity (Asian >5%):</b> ", round(asian.p, 2), "%<br>",
                              "<b>House Value:</b> ", ifelse(house_value_below_mean == "Below Mean", "Below Mean", "Above Mean"))) %>%

  # Add train stations with size based on number of stops
  addCircleMarkers(data = stop_counts_sf,
                   radius = ~sqrt(num_stops) * 3,  # Bigger dots for more stops
                   color = "purple", fillOpacity = 0.7,
                   popup = ~paste0("<b>Station:</b> ", stop1, "<br>",
                                   "<b>Number of Stops:</b> ", num_stops)) %>%

  # Add legend for Diversity
  addLegend(position = "topright",
            colors = c("blue", "grey"),
            labels = c("Diverse (Asian >5%)", "Not Diverse"),
            title = "Diversity") %>%

  # Add legend for House Prices
  addLegend(position = "bottomright",
            colors = c("red", "black"),
            labels = c("Below Mean", "Above Mean"),
            title = "House Value Stats")

2. Explanation

Briefly explain the choices you made for your visualization(s). What do you think the prospective homeowner will find most useful? Are there any locations in particular that stand out?

I chose to show the homeowner’s three main priorities: diversity, number of stops, and house value. In depicting these three choices with different mechanisms— the outline, size of purple dots, and filled in color, I believe this is an effective way to communicate a lot of information on one graph. I believe the houseowner will find this useful. The locations that first stand out are in the very center or slightly north of the state, as the house value stats are cheaper and the diversity is higher, however they have no stops that connect directly to grand central. So the options further south-west, at the tail end of the purple dots seem to be the best options for the homeowner.

Shiny

3. Basic Shiny App

Do the following to create a basic shiny app and publish it to shinyapps.io

  • Configure R Studio to communicate with shinyapps.io by following the directions from this page https://shiny.rstudio.com/articles/shinyapps.html up to and including the “Method 1” section. You will show you how to install rsconnect, create a shinyapps.io account, and set up communication between R Studio and shinyapps.io. Note that you will only have to do this once.
  • Create a new shiny app like we did in class by clicking File, New File, Shiny Web App. Name the app pset5app1, choose Single File, choose an appropriate directory, and click Create. An app.R file will open in R Studio.
  • Click Run App to see the app. The app should appear in the Viewer tab in R Studio, or in its own window.
  • In the upper right of the app, click Publish. Make sure the correct files and account are selected. Click Publish in the lower right.

Paste the url for your app here. The url will be of the form https://yourusername.shinyapps.io/pset5app1/

url here

https://pearlmallick.shinyapps.io/pset5app1/

        id                                                    name
1 14025584                                               pset5app1
2 14026173                                              pset5_app2
3 14026338 pset5-property-values-interactions-splines-shiny-plotly
4 14030700                                                    app2
                                                                                         url
1                                               https://pearlmallick.shinyapps.io/pset5app1/
2                                              https://pearlmallick.shinyapps.io/pset5_app2/
3 https://pearlmallick.shinyapps.io/pset5-property-values-interactions-splines-shiny-plotly/
4                                                    https://pearlmallick.shinyapps.io/app2/
      status        created_time        updated_time  size instances guid title
1   sleeping 2025-02-18T02:44:12 2025-02-18T15:01:54 large        NA   NA  <NA>
2 terminated 2025-02-18T04:18:40 2025-02-18T15:59:12 large        NA   NA  <NA>
3   sleeping 2025-02-18T04:49:19 2025-02-18T15:42:55 large        NA   NA  <NA>
4    running 2025-02-18T16:08:22 2025-02-18T16:18:41 large        NA   NA  <NA>
                                             config_url
1 https://www.shinyapps.io/admin/#/application/14025584
2 https://www.shinyapps.io/admin/#/application/14026173
3 https://www.shinyapps.io/admin/#/application/14026338
4 https://www.shinyapps.io/admin/#/application/14030700

4. Shiny App for choosing home location

Create a Shiny app that contains visualization you made in #1, and add at least one widget that allow the user to investigate the data by customizing the visualization in some useful way. For example, you could add a sliderInput or selectInput that helps you filter the data, or a selectInput that allows the user to choose how to color the points, etc.

Different types of user inputs can be found here https://shiny.rstudio.com/gallery/widget-gallery.html

Publish your shiny app to shinyapps.io, and paste the url for your app here:

url here

https://pearlmallick.shinyapps.io/app2/

Summarize in a sentence why you choose those user inputs and what you learned about the data from the app.

I added a slider input to filter towns by house value, a checkbox to show only towns with greater than 5% asian, and a dropdown (select input) to choose a Metro-North station. These inputs allow users to interactively explore affordability, diversity, and public transit accessibility, helping them make informed decisions about where to live. I learned from this data very quickly and easily that there are a few places, in the very middle of CT that would work perfectly for the homeowner, by using the slider and selecting the criteria.

CT Property data

First let’s load the CT property and keep only the New Haven properties.

d = readRDS('data/coast.properties.rds')
d = st_drop_geometry(d)

## I'm gonna change some column names 
colnames(d) = gsub('_', '.', colnames(d))

colnames(d)[colnames(d) == 'Assessed.Total'] = 'value'
colnames(d)[colnames(d) == 'Number.of.Bedroom'] = 'beds'
colnames(d)[colnames(d) == 'Number.of.Baths'] = 'baths'
colnames(d)[colnames(d) == 'Number.of.Half.Baths'] = 'half.baths'
colnames(d)[colnames(d) == 'Living.Area'] = 'living'
colnames(d)[colnames(d) == 'SHAPE.Area'] = 'land'
colnames(d)[colnames(d) == 'Condition.Description'] = 'cond'
colnames(d)[colnames(d) == 'State.Use.Description'] = 'use'
colnames(d)[colnames(d) == 'GlobalID'] = 'id'
colnames(d)[colnames(d) == 'Mailing.Address'] = 'address'
colnames(d)[colnames(d) == 'Mailing.City'] = 'city'
colnames(d)[colnames(d) == 'Town.Name'] = 'town'
colnames(d)[colnames(d) == 'Valuation.Year'] = 'val.year'
colnames(d)[colnames(d) == 'Sale.Price'] = 'sale.price'
colnames(d)[colnames(d) == 'Sale.Date'] = 'sale.date'
colnames(d)[colnames(d) == 'AYB'] = 'ayb'
colnames(d)[colnames(d) == 'EYB'] = 'eyb'

d = d %>%
  
  filter(value != 0, 
         grepl('Single Fam|SINGLE FAM|One Fam|ONE FAM', 
               use), 
         Qualified == 'Q', 
         sale.price != 0, 
         !is.na(living), 
         living != 0, 
         Condition %in% c( 'E', 'EX',             ## excellent
                           'VG', 'G', 'GD',       ## very good, good
                           'A+', 'A', 'AV', 'A-', ## avg+, avg, avg-
                           'F', 'FR',             ## fair
                           'P', 'VP' ),           ## poor, v poor
         !duplicated(id)) %>%       
  
  mutate(cond = case_when(cond == 'EX' ~ 'Excellent', 
                          cond == 'Avarage' ~ 'Average', 
                          cond == 'AV' ~ 'Average',
                          cond == 'F' ~ 'Fair', 
                          cond == 'FR' ~ 'Fair', 
                          cond == 'GD' ~ 'Good', 
                          cond == 'G+' ~ 'Very Good', 
                          TRUE ~ cond), 
         
         dist = as.numeric(dist), 
         land = ifelse(land < 0, -land, land)) %>%
  
  select(value, living, land, 
         beds, baths, half.baths, 
         cond, use, dist, ayb, eyb, 
         sale.price, sale.date,
         address, city, town, id, 
         centroid.x, centroid.y, 
         point.x, point.y)

d = d %>% 
  filter(town == 'NEW HAVEN')

5. Data exploration with ggplotly and plotly

Recall we noticed what looks like two distinct clouds of points in this visualization.

library(ggplot2)
library(plotly)
library(dplyr)

# Create scatter plot with enhanced tooltips
g <- ggplot(d, aes(x = living, y = log(value), 
                   text = paste(
                     "<b>Town:</b>", town,
                     "<br><b>Living Area:</b>", living, "sqft",
                     "<br><b>Land Size:</b>", land, "sqft",
                     "<br><b>Bedrooms:</b>", beds,
                     "<br><b>Bathrooms:</b>", baths,
                     "<br><b>Use:</b>", use,
                     "<br><b>Address:</b>", address,
                     "<br><b>Sale Price:</b>", sale.price,
                     "<br><b>Property Value:</b>", value
                   ))) +
  geom_point(color = "navy", alpha = 0.7) +  # Set point color to navy
  labs(title = "Living Area vs Log(Property Value)", 
       x = "Living Area (sqft)", y = "Log(Property Value)") +
  theme_minimal()

# Convert to interactive plot with plotly
ggplotly(g, tooltip = "text")

Use ggplotly or plotly to make a interactive scatter plot of log(value) vs living. Add color, tooltip, or other features that help you investigate why there appears to be two distinct clouds of points. What characteristics do the properties in the upper right cluster of points have in common?

In the upper right cluster of points, they are all more bedrooms and more expensive. Additionally, all the top right cluster addresses are closer to Yale’s campus based off of the addresses in the tooltip, and the addresses on the left-hand bottom part of the graph are streets further away from Yale’s campus. This sort of clustering is super interesting but makes intuitive sense.

6. Data exploration with leaflet

Use leaflet to make an interactive map of the properties. Add color, tooltips, or other features that helps you further investigate why there are two clouds of points. Does this map confirm your conclusions from the previous question?

# Load leaflet library
library(leaflet)

# Define color palette based on property sale price
palette <- colorNumeric(palette = "viridis", domain = d$sale.price)

# Create leaflet map
m <- leaflet(d) %>%
  addTiles() %>%  # Add basemap
  addCircleMarkers(
    lng = ~point.x, lat = ~point.y,
    radius = 4,  # Marker size
    color = ~palette(sale.price),  # Color by sale price
    fillOpacity = 0.7,
    popup = ~paste("<strong>Address:</strong>", address,
                   "<br><strong>Living Area:</strong>", living, 
                   "<br><strong>Land Size:</strong>", land, 
                   "<br><strong>Beds:</strong>", beds, 
                   "<br><strong>Baths:</strong>", baths, 
                   "<br><strong>Sale Price:</strong>", sale.price, 
                   "<br><strong>Year Built:</strong>", ayb)
  ) %>%
  addLegend("bottomright", pal = palette, values = d$sale.price, 
            title = "Property Value", opacity = 1)

# Print map
m

This does support my previous conclusions! The area right near Yale appear to have more green-tinted dots, while the large majority of the areas around are purple. The tooltip in the Leaflet map includes key property details such as address, living area, land size, number of beds and baths, sale price, and year built, providing a quick summary of each property. The color of the markers reflects the property’s sale price, using a continuous gradient from the “viridis” palette to visually distinguish higher and lower values. Additionally, the legend in the bottom right corner dynamically updates to display the range of property values, helping users interpret the map more easily.